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Abstract. Correlation of gene histories in the human genome determines the 
patterns of genetic variation (haplotype structure) and is crucial to understanding 
genetic factors in common diseases. We derive closed analytical expressions for the 
correlation of gene histories in established demographic models for genetic evolution 
and show how to extend the analysis to more realistic (but more complicated) models 
of demographic structure. We identify two contributions to the correlation of gene 
histories in divergent populations: linkage disequilibrium, and differences in the 
demographic history of individuals in the sample. These two factors contribute to 
correlations at different length scales: the former at small, and the latter at large 
scales. We show that recent mixing events in divergent populations limit the range of 
correlations and compare our findings to empirical results on the correlation of gene 
histories in the human genome. 
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1. Introduction 

Populations are sliaped by demograpliic, liistorical and social factors, determining gene 
histories in characteristic ways. Empirical data on genetic variation are now routinely 
interpreted using well-established gene-genealogical models [1-4] of the population in 
question. Local properties of genetic variation (pertaining to loci, short stretches 
of a chromosome) in such models are very well understood, by means of models of 
bottlenecks, population expansion [5-8], and migration [9-11]. By contrast, very little 
is know about global patterns [12]. Global correlation and variation of patterns appear to 
be the key to understanding the genetic factors contributing to common diseases: there is 
now a wealth of empirical information on the variation of genetic material in the human 
genome [13]. Many common diseases (such as cancer, obesity, cardiovascular disorder 
and diabetes) are caused by combinations of genetic and environmental factors [4]. In 
some cases a common variant of a single gene is responsible for specific syndromes. In 
more complex diseases, however, it may not be possible to link a disease to a single 
genetic factor. It is thus necessary to understand genome-wide association of genetic 
factors. 

Mutations and linkage disequilibrium (explained and illustrated in figure 
determine the genetic history of a population, which in turn shapes the patterns of 
genetic variation of interest in gene association studies [4,12]. The question is: how 
strongly are the patterns at two different loci correlated? Reich et al [3] estimate 
the empirical association of polymorphism rates, as a function of the physical distance 
between the loci on the same chromosome, from human population data (compensating 
for variations in the mutation rate along the chromosome by comparing to the population 
data from the great apes). Assuming a neutral model with uniform mutation rate, the 
covariance of polymorphism rates is given by the covariance of the times to the most 
recent common ancestor of the two loci (c.f. figure Q^). Kaplan and Hudson [14] (see 
also [15]) analysed the association of polymorphism rates for short loci, within the 
standard unstructured neutral model. This was further developed by Pluzhnikov and 
Donelly [16], who analysed optimal sample sizes for surveying genetic diversity. Hudson 
[17] and McVean et al [18] estimate the recombination rate likelihood from two-locus 
sample statistics, based on simulations. Recombination rate likelihoods, conditional 
on more than two sites, have also been estimated using Monte-Carlo methods [19-21]. 
Although statistically powerful, these methods are computationally very demanding. 
Linkage disequilibrium is often assessed through summary statistics such as [22] or 
D' [5]. McVean [23] introduced an approximation of the expected value of r^, and 
showed that the approximation is accurate, in the absence of demographic structure, if 
the expectations are taken conditional on intermediate allelic frequencies. 

In this paper, we derive analytical expressions for the correlation of genetic histories 
in established models of demographic history (see figure Eti-^c) in the limit of negligible 
selection. For several reasons these results are of interest. First, as explained in the 
following, they enable us to gain a qualitative understanding of the relative importance 
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of different biological factors determining the empirically observed patterns of linkage 
disequilibrium. Second, the analytical results summarised in this article can be easily 
generalised as explained below (see figure |21i,e). Third, our analytical expressions for 
the decorrelation of gene histories allow for studying the implications of variations of 
the recombination rate along the chromosomes [24,25]. The remainder of this paper is 
organised into five parts. We begin by discussing gene-history correlations and linkage 
disequilibrium in section |21 (see also figure Q). In section El we describe our method. We 
summarise our results in section |3] and discuss their implications in sectional In section 
iniwe draw conclusions. Two appendices summarise details of our calculations. 

[Figure 1 about here.] 

[Figure 2 about here.] 
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2. Gene-history correlations, linkage disequilibrium, and patterns of 
genetic variation 

Genetic variation is caused by multiple factors. Together, mutations and recombination 
(figure^ are the most important determinants of the large-scale haplotype structure in 
the human genome [3,4,12]. The genetic history of nearby sites is closely related, while 
distant sites may become unrelated only a few generations in the past. 

Correlation of gene histories determines the degree of association between patterns 
of genetic variation at different loci. An example is the correlation of the counts of 
single-nucleotide polymorphisms (SNPs) at different loci: let Sx(ij) be the number of 
SNPs at locus X between a pair of chromosomes i and j. Further, let Tx{ij) denote the 
time to the most recent common ancestor of a locus at position x on chromosomes i 
and j, and define Ty(^ij) correspondingly for the locus at position y. Then the sample 
covariance of the number of SNPs in non- overlapping loci x and y is related to the 
covariance of times Tx{ij) and ry(jj) as follows 

cov[5^(ij), ^ {2fiLf cov[r^(ij-), Ty^ij)] . (l) 

Here L is the size of the loci, assuming variations in the mutation rate n along the 
chromosome are negligible. For (P) to hold, L must be small enough that the sites 
within each locus have a high degree of linkage (in humans, L must be of the order of 
or smaller than a few hundred base-pairs). 

Associations between SNPs in the genetic mosaic allows for efficient mapping of 
genes. Suitably chosen, a relatively small set of SNPs can capture most of the common 
patterns of variation in the genome [4]. 

The decay of the covariance cov[Tx(^ij), Tj,(jj)] as a function of |x — 1/| measures linkage 
disequilibrium. In the remainder of this section we briefly comment on other common 
measures of linkage disequilibrium. Global association between patterns of diversity, 
quantified by the extent of linkage disequilibrium is often measured by Tajima's D' [5] 
or alternatively by 

^2 ^ tl ^ (2) 

/a(x)(1 - fA{x))fB{y){'i- - fB{y)) ' 

where D = fA(x)B(y) — fA(x)fB(y), A{x) and B{y) are the allelic types at the loci x and y, 
respectively, and fA{x)B{y) is frequency of alleles A{x) and B{y) on the same chromosome 
in the sample [5]. McVean [23] introduced an approximation to the expected value of r^, 
called crj, which makes the connection to the correlation of gene history explicit. With 
the notation Eij^ki = {Tx{ij)Ty(ki)) , 



[n 



2 



2n + 2)Eij,i, - 2{n - 2fEi,.ik + (n - 2)(n - 3)^^, 



' ~ 2E,,^ij + 4(r2 - 2)Eij^ik + {n-2){n- 3)^,,- fc, " ^ ' 

The factors Eij^ij and Eij^ik are defined analogously. For unstructured populations, o"^ 
and the expected value of are approximately equal under the neutral dynamics, if the 
expectation is conditioned on intermediate allelic frequencies [23]. 
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3. Methods 

In the foUowing we analyse how correlation of gene histories depends on demographical 
factors. In a large, unstructured population with constant population size, and when 
selection is negligible, the ancestral history of a locus may be modeled as a Markov 
process [2,26,27], where the states of the process correspond to different configurations 
of ancestral DNA through the history of the sample. 

We trace the ancestral history of two loci (at positions x and y) in n individuals, 
from the present back in time until the most recent common ancestor has been found 
for all loci. When the population size is large, the genealogical process may be 
approximated by the so-called coalescent process [1]: recombination is modeled as a 
Poisson process with rate r per generation per chromosome: for any given chromosome, 
with probability r (also known as the recombination fraction) the loci stem from different 
parents. The probability that one pair of individuals has a common ancestor in the 
preceding generation, and the probability that an individual inherits genetic material 
from both parents, are expanded in A^^^ to the first order. Time is measured in units 
of 2A^ generations. In the limit of large N, the time to the next event is approximately 
exponentially distributed [1]. 

By explicitly taking into account the symmetries of the state space of the coalescent 
for two individuals, we obtain a compact representation of the Markov process (figure E]) 
which allows us to derive and understand gene-history correlations in the models 
mentioned in the introduction. 

We illustrate our approach by re-deriving Hudson's result for the correlation of gene 
histories in the unstructured, constant population- size coalescent model [15]. Consider a 
sample of two individuals. Figure 01 shows a representation of the coalescent for this case. 
Each node in the graph corresponds to a configuration of ancestral DNA (listed in the 
table in figureE)). Due to the symmetries of the coalescent, many different configurations 
may be mapped onto the same node. 



The time evolution of the probability distribution Pi{t) over the states i is given by 
the master equation 



time is measured in units of 2A^ generations. The process is started in state 1, and 
proceeds until it comes to state 5. We find that {'Tx{ij)'Ty{ij)) is given by the exit rates to 
state 5, via states 1 and 4. Let ri be the first time at which a locus coalesces, and T2 be 
the time when both loci have coalesced. Since T^{ij)Ty(^ij) = T1T2 we obtain 



[Figure 3 about here.] 




(4) 



where Wi^j is the transition rate from state i to state j, given in figure 01 As above. 
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where v = ui = (1, 0, 0)^, U2 = (0, 2, 2)^ and M is a three-by-three matrix defined by 



Mjj = Wj^i for z, j = 1, . . . , 3 and i ^ j, and Ma = — Wi^j. Evaluating (0) we 
obtain the well-known result [15,27] 

where R = 4Nr. In order to calculate aj for the unstructured model, we obtain 
{'rx{ij)'ry(ik)) and {T^(ij)Tyi^ki)) from © with v = (0, 1, 0)'^ and v = (0, 0, 1)"^, respectively. 
Inserting these into eq. we recover the result of McVean [23]: 

2_ 2(6 + i?) +n(10 + lli? + i?2) +n2(10 + i?) 
~ 2 (6 + i?) - n (14 + 13R + W^) + ^2(22 + 13i? + W^) ' ^ ' 

In the following, we consider models corresponding to Markov processes with rates 
which are piece-wise constant functions of time t. This allows us to calculate {Tx{ij)Ty{ij)) 
from by taking M and u to be functions of time. 
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4. Results 



After having illustrated our approach, we now briefly describe the demographic models 
we have considered and summarise our results for gene-history correlations in these 
models. Mathematical details are given in appendices A and B. Implications are 
discussed in section 5. 



4.1. Bottleneck model 

Consider (c.f. [28]) an unstructured population of constant size N until tq = 2NG 
generations ago. The population was then subject to a severe bottleneck of short 
duration, followed by a rapid expansion to a very large (infinite) population size 
(figureEK)- Between the bottleneck and now, the population size is taken to be effectively 
infinite: and thus the probability that two randomly sampled individuals have a common 
ancestor before the bottleneck is negligible. Since the bottleneck is very narrow and has 
a short duration, we may ignore the effect of recombination during the bottleneck. It 
is convenient to parameterise the duration of the bottleneck in terms of the probability 
F that a single locus coalesces during the bottleneck. In the limit when both the 
population size and duration of the bottleneck are small (compared to 2N individuals 
and generations, respectively), we obtain (appendix A): 

A + B e"-^<^/2 + C e~^'^ 
T-,fe)) = 15 (2 -/,) (18 + 13 + ' 

where h = 1 — F and 

A = 6(36 - 45h + 20h^ - h^) + 3(28 - Qbh + 

+ 40/1^ - 3/i^)i? + (1 - hf{Q + 3/i + h'^)R^ , (9) 
B = 12(9 - 5h^ + h^) + (3 - 5h^ + 2h^)R^ 

+ 6(7- 10/i^ + 3/i^)i?, (10) 
C = 6(36 - lOh^ - h^) + (6 - 5h^ - h^)R^ 

+ 3(28 - 20/i^ - ?,h^)R. (11) 

We thus find that this model exhibits correlations at arbitrarily large values of 
i?, a consequence of an infinite expansion rate after the bottleneck, and negligible 
recombination within it. If, instead, the expansion were to a finite population size, 
(smaller than GN , say), the correlations would still converge to a constant at large R. 
The constant, however, is expected to be lower than the asymptotic value obtained from 
(4) as i? — > oo. Finally, if the bottleneck lasts long enough for significant recombination 
to occur within it, we still find long-range correlations, up to scales of the order of 
(2rDr)~^ where is the duration of the bottleneck (in generations). Beyond this, 
the correlations decay, and in the limit i? ^ oo we have p{Tx(ij),Ty[ij)) ^ as in the 
unstructured population model. 
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By the same approach, we calculate {Tx[ij-^Ty(^i^-^) and {Tj.(j^j-^Ty(^i^i-^) . Inserting this into 
Q yields, for large n: 

r.-GR 



a 



d 



18 h (36 - 10 /i^ - /i^) + 9 /i (28 - 20 ^ - 3 /i^) ■ 



{rx(ij)Ty[kl)) 

?,h{<o-hh^ -h^)BA, (12) 



where 



{Mij)Ty{ki)) = 18 (45 + 36 /i + 90 G /i + 20 - h^) + 

9 (65 + 28 /i + 130 G /i + 40 /i^ - 3 /i^) + 

(45 G^ + 18 /i + 90 G /i + 30 /i^ - 3 /i^) i?^ (13) 

Note that o"^ — as -R — ^ oo. The difference, in particular, to expression (7) is not 
large. Hence, when the aim is to detect the population-size variations it is better to 
focus on single-locus statistics. 

4-2. Model of divergent populations, I 

Reich et al. consider a model of a diverging population [3]: the population was 
unstructured with constant population size until tq = 2A^G generations ago, when 
the the population split into two parts of equal size (note that this implies a rapid 
population expansion from N/2 to after the split). The model is illustrated in 
figure Efc. A portion p of the sample is chosen from the first population, and the rest 
from the second population. For any two individuals in the sample, the expectation 
p{jx{ij)^'Ty{ij)) depends on whether the individuals come from the same sub-population 
or not. Using the technique illustrated above, it is straightforward to calculate the 
expectation for both cases. Again, we find long-range correlations, namely 

1 

l + 2p{l-p) ("-2^ + 2^2)^2 

in the limit of large R (in appendix B we describe how to obtain the full result, valid 
for arbitrary values of R). 

Further, in the limit of large R and large sample size n, we have 

_ 2p\l-pfG 

l + 2p(l-p)G- ^^^^ 

Thus, for this model is finite in the limit of large i?, as opposed to in the 
unstructured model (sectional and the bottleneck model (section l4.1|) . 



P(rxfo-), r^^fe)) = 1 - , , ivT^ , ' (14) 



4-.3. Model of divergent populations, II 

Now consider the model of two diverging sub-populations [28] in figure Eb- The 
population was unstructured with constant size of A^ individuals until tq = 2NG 
generations ago, when a fraction 7 of the population diverged. In subsequent 
generations, the two sub-populations where unstructured but with no contact between 
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sub-populations. Individuals are randomly chosen from the joint population. For two 
individuals in the sample, there are three cases: both individuals may come from the 
smaller sub-population, they may come from the larger sub-population, or from different 
sub-populations. Using equation (0) we find long-range correlations: in the limit of large 
R, p remains finite, 

p{Mij).M^l)) = - 2s + 2s' + 2G{2 + G)s + s^e'- + (16) 

s e i-T + 2s(l — 7) e i-t -|- 2s7 e t — (r)' J 

where 5 = 7(1 — 7) and 

(r) = 1 + 5(26" - 1) + S7e"? s(l - 7)e"^ (17) 
var[r] = 2 + 2s[2s + (G + 1)' + 7(1 + G + 7)e"^ + 

+ (1 - 7)(2 + G- 7)e-^ - 3] - {rf . (18) 

See the appendix for the full result. The long-range correlations are found to be due to 
samphng of different sub-populations. 

In the limit of large R and large sample size, we have 

. 7^(1-7)^ 



Again, we find that a'j is finite in the hmit of large R. 



2G + 7(1 -e"~) + (1 -7)(1 -e"~) . (19) 
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5. Discussion 

Figure m shows the correlations p{Tx(ij),'T'y(ij)) in the demographic models considered, 
with parameters chosen to be consistent with the empirically estimated time to the most 
recent common ancestor and its coefficient of variation [3] . When plotting the correlation 
of gene histories against physical positions, we need to translate the recombination 
fraction r into the corresponding expected number ax of crossover events between the 
two loci. There are many such maps proposed in the literature (see e.g. [29] for a review 
of these). They differ in how they model the chiasma process, but all models have in 
common that for small enough r, r ^ ax. In humans, r ^ crx for x < lO^bp. At 
larger distances, deviations from linearity are not noticeable since the expressions for 
piTx(ij),Ty(^ij)) and (T^ converge for large R (to different values, in general). Also shown 
are empirical estimates of lower and upper bounds on the correlation of gene histories 
in the human genome [3] . The correlations for the models described in section |3] are 
substantially larger at large distances than those for the unstructured model, but they 
lie significantly below the lower bound of the empirical data, at intermediate distances. 
We comment on possible causes for this discrepancy in our conclusions. 

[Figure 4 about here.] 

Our results allow us to gain a qualitative understanding of the influence of 
demographic factors on the decorrelation of gene histories. First, we find that models 
of bottlenecks and divergent populations (figure Ej) both exhibit long-range correlations 
in gene histories, as numerically demonstrated in [3], but for very different reasons. In 
bottlenecks, the length scale at which we find significant correlations is governed by 
the degree of recombination within the bottleneck: low recombination in the bottleneck 
gives rise to long-range correlations. Further, the amount of correlation is affected by 
the rate of expansion of the population after the bottleneck: rapid expansion gives high 
correlations. Long-range correlation in divergent models, on other hand, we ascribe to 
the fact that the covariance of r^iij) and Ty(^ij) (that is, the number of generations since 
the common ancestor of two copies of loci x and y) is different when individuals are 
selected from the same or different sub-populations: typically, the covariance is lower 
for individuals from the same sub-population than from different ones. We find that 
this effect persists even for loci far apart, but is decreased by population expansions 
during the divergence. 

Second, we identify two contributions to the correlation of gene histories in divergent 
populations: linkage disequilibrium and the sampling of sub-populations with different 
demographic histories. At short ranges, linkage disequilibrium correlates nearby 
patterns by co-inheritance. Thus, for small distances, we conclude that the demographic 
structure is unimportant: all reasonable models must give high correlation for small 
distances. For long ranges, by contrast, correlations due to linkage disequilibrium are 
expected to vanish, but the contribution from differences in gene history across sub- 
populations remains. 
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Third, the domestication of crops and animals has shaped the genetic makeup of 
the species, through selection for desirable traits but also through the demographic 
history of each species [28]. The pattern of genetic differences in the laboratory mouse 
population depends strongly on its demographic history [30]. In divergent populations, 
we find that long-range correlations are insensitive to the demographic history of the 
sub-populations. As a consequence, we predict that the most important contribution to 
the correlation of gene history in the laboratory mouse is from the original divergence 
from the wild-type mouse. 

Fourth, we found that within the models described in section |3J gene-history 
correlations are substantially increased as compared with the unstructured, standard 
model. However, the correlations still lie significantly below the empirically determined 
data at intermediate distances. In [25] it was shown that incorporating empirically 
observed variations in the recombination-rate along the chromosomes [24] significantly 
increases the correlations in this regime. Our analytical expressions for the correlation 
of gene histories allow for studying the effect of such variations in the recombination 
rate in models with demographic population structure. 

Fifth, we briefly mention possible extensions of the scheme introduced in this paper. 
In more general sampling schemes (different from those depicted in figure |21), we may 
use the expressions for {Tx(ij) 'Ty{ij)) conditional on whether the individuals in the sample 
came from the same sub-population or not, and conditional on the population size during 
the divergence, to calculate the correlation of gene histories by weighting the different 
contributions by the probability that they occur under the sampling scheme. Also, 
it is straight-forward to extend the calculations to combinations of bottlenecks and 
divergent populations (figure |2|i) , and to more complicated models involving more than 
two diverging branches (figure |2t). It is expected that the most distant (symmetric) 
divergence determines the long-range correlations. 

How would a recent mixing event (figure |2t) affect the correlation of gene histories? 
A merging of the divergent populations g generations ago leads to a decorrelation of 
gene histories at distances of the order of {4gr)~^, since then ancestral lines of both loci 
may come from different sub-populations with approximately equal probability. 

Finally, we have argued that the correlation p{Tx{ij),Ty(ij)) of gene histories 
determines the association of SNP counts, coY[Sx{ij), Sy(^ij)\. Conversely one may be 
interested in estimating model parameters from population data, deducing p{Tx(ij), Ty{ij)) 
from the pairwise statistic cov[Sx{ij), Sy^ij)]. Three questions arise. First, how can one in 
practice estimate coY[Tx(ij), Ty{ij)] from the variance of SNP counts? Second, how good is 
this estimate? Third, how much of the information the full data set (possibly pertaining 
to a large number of individuals) is retained in the pair- wise statistic coY[Sx(ij), Sy(ij)]7 
We begin by answering the last question. Due to the high amount of association between 
the chromosomes in a sample, the information on genealogical history accumulates 
slowly as the sample size is increased [17]. It follows that most information can be 
found in pair-wise comparisons between the chromosomes in the sample as used in 
eq. (HJ. Going back to the first two questions, an estimator for p(Ty(jj), r(y+a;)(jj)) can be 
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constructed as follows. Assuming that the length of the sequences is long, we can 
estimate the correlation of polymorphism rates by averaging over all pairs and positions: 

~s is* s 

p(^yfo-)' ny+xm)) - p{x) = =f^=2 — =' (20) 

OZ Jy 



where 



-1 Lc-^ 



SySy+. - ^^^_^.n_^_L)Y.Y. Yl Syi^j)Siy+-)iij) ■ (21) 

\ / \ I j=2 j=\ y=l 

and the single-locus quantities Sy and Sy are defined similarly. Instead of regularly 
spaced bins, as in (j2H), one may use randomly positioned bins. For unstructured 
populations, and for populations with bottlenecks and expansions, the accuracy of the 
estimator p{x) depends mostly on the number of bins (and hence on Lc), and improves 
only slowly with increasing n. For divergent models, however, increasing n improves the 
sampling from the different sub-populations. In figure El we show how p{x) compares to 
p{jy{ij),T(^y+x){ij)) when applied to a sample. As can be seen in the figure, when x < L 
the bins overlap and f){x) overestimates the correlations, but otherwise it works well. 



[Figure 5 about here.] 
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6. Conclusions and outlook 

We have derived closed analytical expressions for the correlation of gene histories in 
established demographic models for genetic evolution. These expressions allow us to 
understand and quantitatively determine how demographical factors give rise to long- 
range correlations in gene histories. 

The correlations analysed here determine the two-person summary statistic (^. 
More information is contained in the mosaics of SNP haplotype patterns for more than 
two individuals, and their associations [17]. It is of great interest to derive corresponding 
expressions for correlations between such patterns in the models considered in this paper, 
especially in the case of more than two loci. Finally we note that the quantity crj, a 
measure of linkage disequilibrium, was shown to be a good approximation to in the 
case of unstructured populations [18]. It is necessary to investigate the relation between 
and in models with demographic structure. 
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Appendix A: Derivation of bottleneck formula 



During the bottleneck, the time between coalescent events is exponentially distributed 
with rate /{2jN), where n is the number of lines carrying ancestral material. 
Recombination events occurs with rate n R/{AN), independent of 7. Thus when 7 
is very small, coalescent events dominate the process. 

We assume that during the bottleneck, the reduction in effective population size is 
so drastic that 7 is effectively zero. By rescaling the time by a factor of 7 and taking 
the limit of 7 ^ we find 



M' 



lim M(7) 7 

7— >o 



-1 1 
-3 






4 
-6 



(A.i; 



so the time evolution operator becomes 



exp(M't) 








2 ^ 2 

^-3t 







2 p-t _ 2 -3i I ± p-6t 
5 ^ 3 ^ ^ 15 ^ 

4 -3t _ 4 -6t 
3 3^ 

„-6t 



(A.2) 



In the original model, the inbreeding coefficient F was specified. We choose to 
parameterise the severity of the bottleneck by its duration D. If the process is in 
state 1 (figure 3) when entering the bottleneck, the probability of coalescence during 
the bottleneck is 



T j^'t 



Ui e 



Ui dt 



-D 



(A.3) 



so we see that by taking D = — ln(l — F), we get the correct inbreeding coefficient. 
We can now express the time evolution operator from the beginning to the end of the 
bottleneck as 

'h \H{l-H^) ^H{3-5H^ + 2H^) 






exp(M'D) 



(A.4) 



D 



where H = 1 — F. The probability that the loci become linked during the bottleneck 
depends on the state of the process when the bottleneck is entered: 

F in state 1 

1(2 + H) F^ in state 2 (A.5) 

^{5 + QH + 3H'^ + H^)F^ in state 3 

Similarly, we have the probability that one locus, but not the other, reaches its most 
recent common ancestor during the bottleneck, depending on the state of the process 
when entering the bottleneck: 



D 



nTe^'Mt 





2 
3 



H7 



8H^ + H^] 



in state 1 
in state 2 
in state 3 



(A.6) 
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Together, ()A.4|) . ()A.5|) and ()A.6|) determines the state of the process after the bottleneck. 
Using this information and the method for the unstructured population as outlined in 
section 2 allows us to derive the gene-history correlation for the bottleneck model. 



Appendix B: Correlation of gene histories in divergent populations 

Assume that individuals come from left sub-population with probability p and from 
the right one with probability 1 — p. The population size in the left and right sub- 
populations are and FA^, respectively, and the population size before the divergence 
is A^. The two-person coalescent process is described by a Markov process over the 
states in table ^ where state 1 is the absorbing state of the process, and the process 
starts in one of states 3 — 11. 

[Table 1 about here.] 
We now define Cj = ( tiT2 \ Process starting in state i ). With these, we may write 

= / £3(7) + (1 - pf esiT) + 2p(l - p) 64(7, r), (B.7) 

{Tx{ij)Ty(ik)) = 65(7) + (1 - pY 65(1) 

+ 2p{l-pfee{^) + 2p\l-p) eeiT) 

+ p{l - pf 67(7, r) + p\l - p) 67(1, 7), (B.8) 
{rx{ij)ry(ki)) =/e8(7) + (1 - p)^ CgiT) 

+ V(i - p) 69(7) + 4p{l - pf eg(r) 

+ Ap\l - pf eio(7, r) + 2p\l - pf en (7, T). (B.9) 

From this, the correlation p{'T'x(ij)iTy{ij)) and may be calculated for both models of 
divergent populations: setting 7 = F = 1 gives the model described in section 14. 2( 
setting F = 1 — 7 and p = 7 gives the model described in section 14.31 

Calculation of 63, ... , en for the model introduced in section 4-2 

The two-locus coalescent in a population of size 7A^ is described by a Markov process 
with the evolution matrix 

'-1/7-i? 1/7 

Ml = R -3/7 - R/2 Af-f . (B.IO) 

R/2 -6/7 

where R = ANr. Before the divergence, 7 = 1 and we denote the corresponding 
evolution matrix M. the coalescent is described by a Markov process with the evolution 
matrix M. Assuming that population is in state 3, 5, or 8 with probabilities vi, V2, and 
f3, respectively, we proceed as for the unstructured population in section IHl calculating 
(T1T2) conditional on starting from distribution v. We obtain e3(7) = Cs(7, (1, 0, 0)'^), 
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65(7) = Cs(7, (0, 1, 0)^), and 68(7) = Cs(7, (0, 0, 1)'^), where 
Cs(7,^) = — (-Mi)-3 [2I-(2I-2-Mi + ^M2) exp(MiG)]'«; 



7 



7 



r 



ul (-M)~3 (2 1 - 2 G M + M^) exp(MiG)?; 



— (-Mi; 

7 



'I2I-7M1 - [21- (2G + 7)Mi + G(G + 7)M2] exp(MiG)}t; 



+ (1 - 7) (I + 7 Mi)-2 I7 e~^/^ I + [ (G - 7) I + 7 G Ml ] exp(MiG) 

+ uJ(-M)~3 [21- (1 + 2G)M + G(G + 1)M2] exp(MiG)v. (B.ll) 

During the spht, the coalescent is described by a Markov process with the evolution 
matrix 



(B.12) 



-1/7-/2/2 2/7 
RI2 -3/7 

A coalescent event during the split happens with the distribution 7~^(1, 1) e^^'^^v, where 
V = (1, 0) when starting from state 6 and v = (0, 1) when starting from state 9. Thus, 
we have the contribution 

rG 1 roo 

/ Ti - (1, l)e^2"i V dri / T2 e-("2-^)dr2 
Jo 7 Jg 

The population is in state 5 or 8, right before the split, with probability a exp(M2G) v, 
where a = (1, 0) for state 5 and a = (0, 1) for state 8. From this we obtain 

66(7) = A(7) + i?7 5(7) 

69(7) =A(7)-2i?(7) 

where 



A(7) = (1 + G)7 + 
and 
5(7) 



(l+G)(l-7) + 



24 + 4/27 



exp 



(4 + i?7)(18 + 13/2 + i22) 
G{6 + R-f) 



-Gh 



(B.13) 



(B.14) 



(4 + /?7) (18 + 13i2+/22) --\^ 27 
Now consider starting from states 4, 7 or 10. In these cases, there is no coalescent 
event during the split. In each sub-population the coalescent is described by a Markov 
process with the evolution matrix 

-RI1 1/7 

m -1/7 

Note that the columns sum to zero: the probability of escaping from these states is zero 
during the split. 

Right before the split, the population is in state 3, 5 or 8 with probability 0i, 02, 
and 03, respectively. Then, the contribution is 



M. 



(B.15) 



G 



r-2 „.T 



rir2e'^^ dr2 



n 



:i + G)2(01 + 02 + 03) 



e^(--^)0dri 

(i? + 18)01 + 602 + 
i?2 + 13i?+18 



(B.16) 
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Now define -Pl(7) as the probability of the genetic material being on the same gamete 

at the moment of the split, given that it is on the same gamete in the sample. We have 

Pl(7) = (1, 0) exp(M3 G) (1, 0)^ = ^^^^ ^ (B.17) 

Similarly, we define Pb{i) as the probability of the genetic material being on the same 
gamete at the moment of the split, given that it is on different gametes in the sample. 
We have 

2-2expf-^^(^) 

Pb(7) = (1, 0) exp(M3G) (0, 1)^ = ' ^ (B.18) 

If the sample is in state 4, we have 

01 = PL(7)a(r) 

02 = Pl(7) [1 - ^L(r)] + [1 - Pl(7)] ^L(r) 

03 = [1 - Pl(7)] [1 - a(r)] (B.19) 
Since 0i + 02 + 03 = 1 "we have 

D = (1 + G)'^ + ^ + ^ ^^<^'> + ^ ^' (B,20) 
Similarly, we obtain 

67(7, r) = (1 + Gf + ^ ' ' "ory.ooT.'o^"'^^'''^""' (B-21) 

and 

eio(7,r) = (1 + G)^+ ^ ' "^"^'^ ' ' ' -y-^v/y-^vw .322) 
Finally, starting from state 11, we obtain 

eii(7, r) = 18 + im + m + + (1 - ^)^"''^'] + r)e-^/1 (B.23) 
Calculation of 63, ... , en /or i/ie model introduced in section 4-3 

In this model, 7 = F = 1 so the formulas simplify considerably. Starting from state 3, 
5 or 8, we obtain 

18 + P 

es = 





P2 + i3/2+ig 




4 + 2Pl(7)4 


-2Pb(F) + (10 + 


p)PL(7)PB(r) 




P2 + 13R + u 




4 + 2Pb(7)- 


+ 2Pb(F) + (10- 


^ R) Pb{i) Pb{T) 



P2 + 13R + 18 
6 

4 

eg = 1 + 



P2 + 13/? + 18 

(B.24) 
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as calculated by Griffiths [26]. Starting from state 6 or 9, we obtain 

+ + (4 + i?)(18 + 13i? + i?2) 
, (24 + 4i?)e-«-4e-«(6+R)/2 

^ + ^) + (4 + i^)(18 + 13i^ + i^^) (^-^^^ 

(B.27) 

Starting from state 4, 7 or 10, we obtain 

64 = a + 8i? 6 + c 

67 =a + A{R-2)h-2Rc 

eio = a-16 6 + 4c (B.28) 



where 



^2 8 21 3(81 + 7/2) 



(2 + i?)2 2 + i? 18 + 13i2 + i22 
6 + i? 



(2 + i?)2(18 + 13i? + i?2) 



g-G(2+ii)/2 



(2 + i?)2(18 + 13i? + i?2) 
Finally, starting from state 11 gives 



^° + ^ e-«(2+^) (B.29) 
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Glossary 

Locus A specific chromosomal location. 

Allele One of several alternative forms of a gene, or DNA sequence, at a locus. 

Genetic mosaic The pattern of differences between individuals in a population. 

Haplotype A block of closely linked alleles that are inherited together. Such alleles are 
often used as markers in the process of gene mapping. 

Linkage disequilibrium At linkage equilibrium, traits at different loci are inherited 
independently. Deviation from this is called linkage disequilibrium. 

Population bottleneck When the population has been subject to a drastic decrease in 
abundance, followed by a rapid increase in abundance. This may happen e.g. when a 
small part of a population colonise a new environment, without extensive interbreeding 
with the main population. 

SNP Single nucleotide polymorphism. A difference in the genetic code at a single 
position. 

Markov process A stochastic process, where the future development depends only on 
the present state (no memory). 

Divergence When a population splits into two parts that does not interbreed, the 
independent accumulation of neutral mutations within each subpopulation leads to that 
the number of genetic differences between individuals from different sub-populations 
increase with time. 

Gene history The sequence of ancestors to a gene. 

Goalescent process An approximation of neutral evolution, valid for large populations. 

Chiasma process Exchange of genetic material between copies chromosome pairs during 
the production of gametes (egg or sperm cells). 

Recombination fraction The probability that two loci on the same chromosome was 
inherited from different parents. 
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Figure 1. Gene history and polymorpliic sites, a In DNA, genetic information is 
encoded by base-pairs of the four nucleic acids adenine (A), thymine (T), guanine (G), 
and cytosine (C). In a sample of three individuals, we show three polymorphic sites, 
with two nucleotides around each polymorphism, b The most common variation is 
a difference at a single position (SNP), caused by a mutation at the position in an 
individual in the history of the population, where e.g. a fraction of the population has 
the nucleotide T at the site, and the rest has the nucleotide A. The three mutations 
in panel a are shown as filled circles. Mutation 4 does not cause a polymorphism in 
the sample, since all individuals in the sample inherits the mutation from the common 
ancestor. Given r (the number of generations since the most recent common ancestor) 
of a stretch of L nucleotides, the number of differences between two individuals is 
assumed to be Poisson distributed with expected value 2/iLr, where is the mutation 
rate per site per generation [1]. c In recombination, part of a gamete (one of the 
two copies of a chromosome) is inherited from one parent and the rest from the other 
parent. We show a sample gene history with one recombination event, for two loci {x 
and y) in two gametes i and j. The time axis is the same as in panel b. The ancestral 
history for loci x and y are shown in blue and red, respectively. The times until the 
most recent common ancestor are Tx{ij) and Tj,(jj) for loci x and y, respectively. In 
the absence of recombination, two loci on the same gamete share the same genetic 
history, and have the same time to the most recent common ancestor, Tx{ij) = '''y(ij), 
causing linkage disequilibrium. If a recombination event occurs in the genetic history 
of a sample, it may lead to a decorrelation of T^^ij) and Ty(jj) . Xi represents the genetic 
material at locus x of chromosome i. Dashes correspond to genetic material not in the 
history of the sample, and the diamonds to common ancestral material. 
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Figure 2. Models illustrating demographic history, i.e. changes in population 
size and structure, a Population bottleneck. b,c Models of population structure 
and expansion, d A more general model of demographic structure, e Demographic 
structure determining genetic variation in the laboratory-mouse genome [30] (time here 
is measured in years). 
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Figure 3. A graph representation of the coalescent process for two loci (a; and y) and 
two chromosomes (i and j). The transition rates (measured in units of 2N generations) 
between the different groups of states, corresponding to the table, arc printed along 
the arrows (i? = ANr). The process starts in state 1 and ends in state 5, the only 
absorbing state. If the path goes from state 1 to state 5 wc have linkage, but if the 
system enters state 4 linkage is broken. Same notation as in figure 1. 
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Figure 4. Correlation p{Ty^ij-f, Ti^y+x)(ii) ) '-^^ gene histories as a function of the distance 
X between them. Equations ©, ijHJ), and exact expressions corresponding to H14|) and 
from the appendix, were used. In all cases, r = 1.2 cM/Mb, iV and /i were 
chosen to be consistent with 2N {t) = 1.55 x 10*, and a coefficient of variation of 
0.94 [3] (except in the unstructured model). The lines are: the unstructured coalescent 
(dashed), bottleneck model with _ff = 0.1 (red), divergent model in figure |2l3 with 
7 = 0.2 (blue), and divergent model in figure |2t with p — 0.3 (green). Also shown are 
empirical estimates of lower and upper bounds for the correlation of gene histories in 
the human genome (squares) [3]. 
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Figure 5. Comparison of p{x) (markers) to p{Ty(^ij-^^ T(^y^x)(ij)) (solid lines, calculated 
from theory), for an unstructured population (red) and a divergent population (blue). 
The estimator p{x) were obtained from a single sample of 50 individuals, with 
Lc = 10Mb, for different bin sizes L = lOObp (diamonds), L = 500bp (circles) and 
L ~ Ikb (squares). The parameters for the divergent model are: G = 0.6, p = 0.3, 
N = 6963.7, r = 0.95633cM/Mb, = 7.6 10""^. In the unstructured population model, 
the population size is = 10^. 
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Table 1. The states of the Markov process of loci x and y in chromosomes i and j, for 
the divergent population. For each state we show the corresponding configurations of 
the sub-populations, separated by a vertical bar. A dash denotes genetic material that 
is not ancestral to any locus in the sample. The symbol denotes a sub-population 
unrelated to sample, and the diamonds denotes a common ancestor to chromosomes i 
and j (for that locus). 
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